Examining the effect of evaluation sample size on the sensitivity and specificity of COVID-19 diagnostic tests in practice: a simulation study

Background In response to the global COVID-19 pandemic, many in vitro diagnostic (IVD) tests for SARS-CoV-2 have been developed. Given the urgent clinical demand, researchers must balance the desire for precise estimates of sensitivity and specificity against the need for rapid implementation. To complement estimates of precision used for sample size calculations, we aimed to estimate the probability that an IVD will fail to perform to expected standards after implementation, following clinical studies with varying sample sizes. Methods We assumed that clinical validation study estimates met the ‘desirable’ performance (sensitivity 97%, specificity 99%) in the target product profile (TPP) published by the Medicines and Healthcare products Regulatory Agency (MHRA). To estimate the real-world impact of imprecision imposed by sample size we used Bayesian posterior calculations along with Monte Carlo simulations with 10,000 independent iterations of 5,000 participants. We varied the prevalence between 1 and 15% and the sample size between 30 and 2,000. For each sample size, we estimated the probability that diagnostic accuracy would fail to meet the TPP criteria after implementation. Results For a validation study that demonstrates ‘desirable’ sensitivity within a sample of 30 participants who test positive for COVID-19 using the reference standard, the probability that real-world performance will fail to meet the ‘desirable’ criteria is 10.7–13.5%, depending on prevalence. Theoretically, demonstrating the 'desirable' performance in 90 positive participants would reduce that probability to below 5%. A marked reduction in the probability of failure to hit ‘desirable’ specificity occurred between samples of 100 (19.1–21.5%) and 160 (4.3–4.8%) negative participants. There was little further improvement above sample sizes of 160 negative participants. Conclusion Based on imprecision alone, small evaluation studies can lead to the acceptance of diagnostic tests which are likely to fail to meet performance targets when deployed. There is diminished return on uncertainty surrounding an accuracy estimate above a total sample size of 250 (90 positive and 160 negative). Supplementary Information The online version contains supplementary material available at 10.1186/s41512-021-00116-4.


Supplementary Results
The FALCON-C19 study moonshot evaluation lasted 6 weeks, across 15 test and trace sites and recruited 880 COVID-19 positive participants. The POC evaluation lasted 11 weeks across 7 sites and recruited 403 negative participants and 118 positive participants. Both hospital sample collection evaluations involved the collection of specimens from the patient that were then sent to external laboratories for evaluation on novel technologies offsite. For the evaluations that involved specimen collection, evaluation A ran for 17 weeks across six sites and recruited 94 positive cases and 147 negative cases and evaluation B ran for 17 weeks across seven sites and recruited 65 positive participants and 73 negative participants. The recruitment rates for positive and negative cases across the different evaluations are visualised in Figure S4. Moonshot demonstrated a much faster rate of positive case recruitment, than any other evaluation. Consequently, it was found to outperform the other evaluations in time to completion whilst the hospital sample collection evaluations require significant amounts of time more than 5 years to complete the largest samples size (Table S4). Interestingly the other evaluations only saw a slight increase in recruitment rate with increasing national prevalence of COVID-19.
Supplementary Figure S1: Reliability and Wilson 95% lower bound across observed outcomes in the evaluation study, for ech of the target product profiles for the evalutation sizes of 30, 150 and 250.
Supplementary Figure S2: scenario estimates of real-world diagnostics accuracy from a series of Monte Carlo simulations per evaluation sample size. Each simulation consisted of 10,000 iterations each consisting of 5,000 individuals. Here the diagnostic test was assumed to achieve the minimum performance for the desirable MHRA TPP with 97% sensitivity and 99% specificity in the evaluation sample ( ). The confidence intervals are displayed for sensitivity and specificity per initial evaluation sample size across different prevalence scenarios. The simulation was considered to have met the TPP confidence interval criterion if the diagnostic characteristic was above the lower 95% CI (sensitivity 93% and specificity 97%), for sample sizes between 30 and 100.
Supplementary Figure S3: scenario estimates of real-world diagnostics accuracy from a series of Monte Carlo simulations per evaluation sample size. Each simulation consisted of 10,000 iterations each consisting of 5,000 individuals. Here the diagnostic test was assumed to achieve the minimum performance for the acceptable MHRA TPP with 80% sensitivity and 95% specificity in the evaluation sample ( ). The confidence intervals are displayed for sensitivity and specificity per initial evaluation sample size across different prevalence scenarios. The simulation was considered to have met the TPP confidence interval criterion if the diagnostic characteristic was above the lower 95% CI (70% sensitivity and 90% specificity), for sample sizes between 30 and 100.
Supplementary Figure S4: Cumulative Recruitment per evaluation compared to prevalence according to the Office of National Statistics COVID-19 prevalence estimate [2]. Moonshot employed a community positive COVID-19 recall strategy based in NHS test and trace centres [1]. Point of care (POC) evaluation was a hospital based evaluation where the technology was deployed to the patient's bedside. A&B were hospital based evaluations with sample collection only, where the samples were run offsite.
Supplementary Figure S5: Regions of the probability of failure to achieve the threshold sensitivity (thresholds: 50%, 70%, 80%, 93%) for an observed number of false negatives in a given evaluation sample size, with an assumed prevalence (prevalence: 1%, 5%, 10%, 15%) in a real-world simulation of size 5000.
Supplementary Figure S6: Regions of the probability of failure to achieve the threshold specificity (thresholds: 90%, 93%, 95%, 97%) for an observed number of false positives in a given evaluation sample size, with an assumed prevalence (prevalence: 1%, 5%, 10%, 15%) in a real-world simulation of size 5000.
Supplementary Table S1: Estimated mean and 95% confidence intervals of the sensitivity and specificity, and the proportion that failed to meet the lower bound of the acceptable TPP criteria confidence interval (sensitivity 70%; specificity 90%) assuming the test achieved minimum performance for the acceptable TPP ( : 80% sensitivity and 95% specificity) in the evaluation sample, for a simulated population size of 5000 (Simulation Setting 2).